Group 1: Silu Wang, Jiacheng Qiao
Supervisor: Zhi He
Gun shooting has long been a social security issue in the US. Every year, a large number of victims were injured or killed in New York City due to a shooting incident. In this regard, analyzing and learning from the historic incident data is meaningful and helpful for both the police departments and the society as a whole. The dataset for this project is a collection of every shooting incident that occurred in NYC going back to 2006 through May 2021, including information about the victim, the perpetrator, the location, and the time of occurrence.
This project tries to provide a descriptive summary on when and where shooting incidents frequently happened and a big picture of the perpetrators and the victims. The goal of this project is to inform NYC citizens of shooting activities summary, including criminal hotspots and frequency, and help NYPD classify perpetrators' age groups based on the information in hand.
import class_utility as c
import eda_utility as u
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report
In the first part, we clean the dataset with several steps to prepare it for the EDA and modeling. Specific steps include:
df_ori = pd.read_csv('files/NYPD_Shooting_Incident_Data__Historic_.csv')
df = c.clean_dataset(df_ori)
df.head()
With the cleaned dataset, we can conduct exploratory data analysis to gain insights on the nature of shooting incidents in New York City. Questions we tried to anser include:
By plotting monthly incidents for each year, it's clear that more shooting incidents happened in summertime than other months. Overall the frequency of shooting increased from spring to summer, decreased since September, and achieved the lowest time in February.
If we break down into days of the week, more shooting happened in Saturday and Sunday. If we break down into time of a day, evening and midnight are the peak hours.
# EDA
# Exploring month frequency
u.month_plotting(df)
# Exploring time period and weekday frequency
u.time_weekday_plotting(df)
According to the age distribution by gender plots, for both victims and perpetrators, there are way more males than females. Also, victims are mostly around 25-44 years old while perpetrators show a younger group with age around 18-24.
u.age_distribution(df)
From the 3 interactive maps below, we are able to explore the shooting hotspots around New York City. In case of overplotting, I only drew the recent 5 years incidents on the hotspot map.
From the maps, we can find that there's a clear boundary in Manhattan, the 97 st, which split the island into two distinctive area. The upper Manhattan is far more dangerous than the central and lower area, in terms of the nunmber of shooting incidents. Also, from the choropleth map, we can find that Brooklyn saw the most shootings in the past decade.
nyc_location = [40.71, -74.00]
hotspot_map = u.hotspot_map(df, nyc_location)
hotspot_map
cluster_map = u.cluster_map(df, nyc_location)
cluster_map